Database optimization through schema modification

ABSTRACT

A database optimizer collects statistics regarding applications accessing a database, and makes one or more changes to the database schema to optimize performance according to the collected statistics. In a first embodiment, the optimizer detects when a certain type of application accesses the database a percentage of time that exceeds a predefined threshold level, and if the data in the database is stored in a less-than-optimal format for the application, the data type of one or more columns in the database is changed to a more optimal format for the application. In a second embodiment, the optimizer detects when one type of application accesses a column a percentage of time that exceeds a first predefined threshold level and is less than a second predefined threshold level, and creates a new column in the database so the data is present in both formats.

CROSS-REFERENCE TO RELATED APPLICATION

This patent application is a continuation of “Database OptimizationApparatus and Method”, Ser. No. 11/277,229 filed on Mar. 22, 2006, whichis incorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention generally relates to computer systems, and morespecifically relates to apparatus and methods for accessing data in acomputer database.

2. Background Art

Since the dawn of the computer age, computers have evolved and becomemore and more powerful. In our present day, computers have becomeindispensable in many fields of human endeavor including engineeringdesign, machine and process control, information storage and retrieval,and office computing. One of the primary uses of computers is forinformation storage and retrieval.

Database systems have been developed that allow a computer to store alarge amount of information in a way that allows a user to search forand retrieve specific information in the database. For example, aninsurance company may have a database that includes all of its policyholders and their current account information, including paymenthistory, premium amount, policy number, policy type, exclusions tocoverage, etc. A database system allows the insurance company toretrieve the account information for a single policy holder among thethousands and perhaps millions of policy holders in its database.

Many databases include data that has existed for decades, oftenoutliving the software applications that originally created the data.New applications are often developed that need to access the data. Theway that data is stored in a database affects the performance ofapplications that access the data. If the data is stored as a particulardata type, but an application requires a different data type, the datamust typically be read, then converted to the desired data type. Thisproblem arises, for example, when data that was originally created bylegacy applications in one data type now needs to be accessed by newlanguages and APIs that expect a different data type.

Legacy applications often store integer data in fields that are in aformat known as PACKED DECIMAL, which is one way to represent numericdata. Like the PACKED DECIMAL data type, the data types SMALLINT,INTEGER, and BIGINT are other alternative ways to represent numericdata. In a Java and JDBC programming paradigm, it is widely recognizedthat retrieval of data from database columns that have a type ofSMALLINT, INTEGER, or BIGINT will perform significantly better thanretrieval of data from a column that has a type PACKED DECIMAL. Theformat of the data stored in the database results in performancepenalties for any application that needs to access the data, but couldrun faster if the data were of a different type. Changing the data typeto accommodate the new applications is generally not an acceptableoption, because changing the data type would require significant manualrework. This problem is especially apparent as companies with existingapplications and databases become internet-enabled, which is commonlydone using Java to access data in existing databases. Using Java toaccess data in less-than-optimal data types in older databases resultsin performance penalties that are significant. Without an apparatus andmethod that allows a database to dynamically evolve according to theapplications accessing its data, the computer industry will continue tosuffer from excessive overhead in porting existing data to newapplications, such as web-enabled applications.

BRIEF SUMMARY OF THE INVENTION

According to the preferred embodiments, a database optimizer collectsstatistics regarding which types of applications are accessing thedatabase, and makes one or more changes to the database schema tooptimize performance according to the collected statistics. In a firstembodiment, the optimizer detects when a certain type of applicationaccesses the database a percentage of time that exceeds a predefinedthreshold level, and if the data in the database is stored in aless-than-optimal format for the application, the data type of one ormore columns in the database is changed to a more optimal format for theapplication. This means that the database optimizer must recognize whena different type of application requests data from any changed column,and must potentially perform a conversion from the new data type to theold data type before returning the requested data. In a secondembodiment, the optimizer detects when one type of application accessesa column a percentage of time that exceeds a first predefined thresholdlevel and that accesses the column a percentage of time that is lessthan a second predefined threshold level. In this case, a new column iscreated in the database so the data is present in both formats, therebyoptimizing the performance of both old and new applications that accessthe data. The database optimizer looks at what type of applicationrequested data, and returns the data in the format optimized for thattype of application.

The foregoing and other features and advantages of the invention will beapparent from the following more particular description of preferredembodiments of the invention, as illustrated in the accompanyingdrawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)

The preferred embodiments of the present invention will hereinafter bedescribed in conjunction with the appended drawings, where likedesignations denote like elements, and:

FIG. 1 is a block diagram of an apparatus in accordance with thepreferred embodiments;

FIG. 2 is a sample database query in Structured Query Language (SQL);

FIG. 3 is a block diagram showing the relationship between the databaseand database manager of FIG. 1, and between the database manager andapplications that need to access the database;

FIG. 4 is a flow diagram of a method in accordance with the preferredembodiments;

FIG. 5 is a sample menu display window that allows a user to setcustomization settings for the database manager of the preferredembodiments;

FIG. 6 is a sample menu display window that allows a user to setthreshold levels that determine the function of the database manager ofthe preferred embodiments;

FIG. 7 is a sample employee table for illustrating the function of thedatabase manager of the preferred embodiments;

FIG. 8 is the employee table of FIG. 7 after adding reflective columnsfor each of the columns in FIG. 7;

FIG. 9 is the employee table of FIG. 8 after deleting the originalcolumns in FIG. 7;

FIG. 10 is a table of tables that represents prior art metadata for theemployee table of FIG. 7;

FIG. 11 is a table of columns in the employee table that representsprior art metadata for the employee table of FIG. 7;

FIGS. 12 and 13 represent the table of tables and table of columns,respectively, that make up the metadata for the employee table in FIG. 7in accordance with the preferred embodiments;

FIGS. 14 and 15 represent the table of tables and table of columns,respectively, that make up the metadata for the employee table in FIG. 8in accordance with the preferred embodiments;

FIGS. 16 and 17 represent the table of tables and table of columns,respectively, that make up the metadata for the employee table in FIG. 9in accordance with the preferred embodiments;

FIG. 18 is a sample menu display window that allows a user to selectwhether the application view metadata or system view metadata isdisplayed when a command to display the metadata is executed;

FIG. 19 is a flow diagram of a method for a database manager to providedata of the correct type to a requesting application; and

FIG. 20 is a block diagram of the database optimizer in FIG. 1.

DETAILED DESCRIPTION OF THE INVENTION 1.0 Overview

The present invention relates to optimizing the performance of accessingdata in a database. For those not familiar with databases, this Overviewsection will provide background information that will help to understandthe present invention.

Known Databases and Database Queries

There are many different types of databases known in the art. The mostcommon is known as a relational database (RDB), which organizes data intables that have rows that represent individual entries or records inthe database, and columns that define what is stored in each entry orrecord.

To be useful, the data stored in databases must be able to beefficiently retrieved. The most common way to retrieve data from adatabase is to generate a database query. A database query is anexpression that is evaluated by a database manager. The expression maycontain one or more predicate expressions that are used to retrieve datafrom a database. For example, lets assume there is a database for acompany that includes a table of employees, with columns in the tablethat represent the employee's name, address, phone number, gender, andsalary. With data stored in this format, a query could be formulatedthat would retrieve the records for all female employees that have asalary greater than $40,000. Similarly, a query could be formulated thatwould retrieve the records for all employees that have a particular areacode or telephone prefix.

One popular way to define a query uses Structured Query Language (SQL).SQL defines a syntax for generating and processing queries that isindependent of the actual structure and format of the database. Onesample SQL query is shown in FIG. 2.

The “select *” statement tells the database query processor to selectall columns, the “from Table1” statement identifies which database tableto search, and the “where” clause specifies one or more expressions thatmust be satisfied for a record to be retrieved. Note that the query ofFIG. 2 is expressed in terms of columns C1, C2 and C3. Information aboutthe internal storage of the data is not required as long as the query iswritten in terms of expressions that relate to values in columns fromtables.

For the query of FIG. 2, the “where” clause specifies that the firstcolumn has a value equal to four (C1=4) logically ANDed with theexpression that the second column is greater than six OR the thirdcolumn is not equal to eight. In the prior art, much effort has beenexpended to optimize queries so they may be executed faster, whichincreases system performance. However, no known efforts have been madeto dynamically change a database's schema according to the type ofapplications accessing the database, the frequency with which theapplication access data in the database, and the location of the data inthe database accessed by the applications.

2.0 Detailed Description

The preferred embodiments provide a way to dynamically tune a databaseto provide data in a format optimized for the type of application thatmost frequently accesses the data.

Referring to FIG. 1, a computer system 100 is one suitableimplementation of an apparatus in accordance with the preferredembodiments of the invention. Computer system 100 is an IBM iSeriescomputer system. However, those skilled in the art will appreciate thatthe mechanisms and apparatus of the present invention apply equally toany computer system, regardless of whether the computer system is acomplicated multi-user computing apparatus, a single user workstation,or an embedded control system. As shown in FIG. 1, computer system 100comprises a processor 110, a main memory 120, a mass storage interface130, a display interface 140, and a network interface 150. These systemcomponents are interconnected through the use of a system bus 160. Massstorage interface 130 is used to connect mass storage devices (such as adirect access storage device 155) to computer system 100. One specifictype of direct access storage device 155 is a readable and writable CDROM drive, which may store data to and read data from a CD ROM 195.

Main memory 120 in accordance with the preferred embodiments containsdata 121, an operating system 122, a database 123, a database manager125, data access rules 128, and run-time statistics 129. Data 121represents any data that serves as input to or output from any programin computer system 100. Operating system 122 is a multitasking operatingsystem known in the industry as OS/400; however, those skilled in theart will appreciate that the spirit and scope of the present inventionis not limited to any one operating system. Database 123 is any suitabledatabase, whether currently known or developed in the future. Database123 comprises any suitable table or collection of tables defined bydatabase schema 124. Database manager 125 suitably includes one or moredatabase APIs 126 and a database optimizer 127. Database APIs 126 arethe application programming interfaces (APIs) that applications may useto access data stored within database 123. In the preferred embodiments,each type of programming paradigm includes its own set of APIs foraccessing data in the database 123. Data access rules 128 correlate aprogramming paradigm (such as COBOL or Java) to its preferred datatypes. Run-time statistics 129 contain statistics that indicate therelative frequency with which each programming paradigm accesses data ina particular portion of the database 123 (such as a selected column orcolumns). Database optimizer 127 monitors the data access rules 128 andthe run-time statistics 129, and makes one or more changes to thedatabase schema 124 to optimize the access of data in the databaseaccording to the run-time statistics 129. In some cases, the databaseoptimizer 127 changes the data type of one or more columns in thedatabase 123. In other cases, the database optimizer 127 adds reflectivecolumns to the database 123 so that data within the database 123 ispresent in multiple data types at the same time. The database optimizer127 then monitors which type of application requests access to the data(by determining which database API 126 is invoked), and retrieves datafrom a column in the database, if one exists, that stores the data in adata type that is optimized for the type of requesting application. Notethat run-time statistics 129 may be collected by the database manager125 within the scope of the preferred embodiments, or may be separatelycollected yet used by the database manager 125.

Computer system 100 utilizes well known virtual addressing mechanismsthat allow the programs of computer system 100 to behave as if they onlyhave access to a large, single storage entity instead of access tomultiple, smaller storage entities such as main memory 120 and DASDdevice 155. Therefore, while data 121, operating system 122, database123, database manager 125, data access rules 128, and run-timestatistics 129 are shown to reside in main memory 120, those skilled inthe art will recognize that these items are not necessarily allcompletely contained in main memory 120 at the same time. It should alsobe noted that the term “memory” is used herein to generically refer tothe entire virtual memory of computer system 100, and may include thevirtual memory of other computer systems coupled to computer system 100.

Processor 110 may be constructed from one or more microprocessors and/orintegrated circuits. Processor 110 executes program instructions storedin main memory 120. Main memory 120 stores programs and data thatprocessor 110 may access. When computer system 100 starts up, processor110 initially executes the program instructions that make up operatingsystem 122. Operating system 122 is a sophisticated program that managesthe resources of computer system 100. Some of these resources areprocessor 110, main memory 120, mass storage interface 130, displayinterface 140, network interface 150, and system bus 160.

Although computer system 100 is shown to contain only a single processorand a single system bus, those skilled in the art will appreciate thatthe present invention may be practiced using a computer system that hasmultiple processors and/or multiple buses. In addition, the interfacesthat are used in the preferred embodiment each include separate, fullyprogrammed microprocessors that are used to off-load compute-intensiveprocessing from processor 110. However, those skilled in the art willappreciate that the present invention applies equally to computersystems that simply use I/O adapters to perform similar functions.

Display interface 140 is used to directly connect one or more displays165 to computer system 100. These displays 165, which may benon-intelligent (i.e., dumb) terminals or fully programmableworkstations, are used to allow system administrators and users tocommunicate with computer system 100. Note, however, that while displayinterface 140 is provided to support communication with one or moredisplays 165, computer system 100 does not necessarily require a display165, because all needed interaction with users and other processes mayoccur via network interface 150.

Network interface 150 is used to connect other computer systems and/orworkstations (e.g., 175 in FIG. 1) to computer system 100 across anetwork 170. The present invention applies equally no matter howcomputer system 100 may be connected to other computer systems and/orworkstations, regardless of whether the network connection 170 is madeusing present-day analog and/or digital techniques or via somenetworking mechanism of the future. In addition, many different networkprotocols can be used to implement a network. These protocols arespecialized computer programs that allow computers to communicate acrossnetwork 170. TCP/IP (Transmission Control Protocol/Internet Protocol) isan example of a suitable network protocol.

At this point, it is important to note that while the present inventionhas been and will continue to be described in the context of a fullyfunctional computer system, those skilled in the art will appreciatethat the present invention is capable of being distributed as a programproduct in a variety of forms, and that the present invention appliesequally regardless of the particular type of computer-readable signalbearing media used to actually carry out the distribution. Examples ofsuitable computer-readable signal bearing media include: recordable typemedia such as floppy disks and CD ROM (e.g., 195 of FIG. 1), andtransmission type media such as digital and analog communications links.

The remainder of this specification describes the detailed function ofthe database optimizer 127 shown in FIG. 1. Referring now to FIG. 3, ablock diagram shows the relationship between the database 123, thedatabase manager 125, and applications 320 that require access to datastored in the database 123. Database 123 is defined by a database schema124 that specifies the details for each table 330 in the database 123,including the number of columns in the table, the width of each column,and the data type of each column. Note that the database schema 124includes a definition for each and every type of table 330 stored in thedatabase 123.

Database manager 125 is a layer of code that runs between theapplications 320 that need to access data in the database 123 and thedatabase 123 itself Database manager 125 includes the database optimizer127 shown in FIG. 1, and includes database APIs 126 that provideinterfaces for applications 320 to access data in the database 123. InFIG. 4, examples of suitable APIs are shown to include one or more COBOLAPIs 332, one or more Java APIs 334, and one or more RPG APIs 336. COBOLAPIs 332 are application programming interfaces that provide aninterface for COBOL applications to access data stored in database 123.Java APIs 334 are application programming interfaces that provide aninterface for Java applications to access data stored in database 123.RPG APIs 336 are application programming interfaces that provide aninterface for RPG applications to access data stored in database 123. Ofcourse, these specific APIs 332, 334 and 336 in FIG. 3 are shown by wayof example, and the preferred embodiments expressly extend to anysuitable API or other type of interface that allows an application toaccess data stored within database 123.

Applications 320 include all applications that may access data withinthe database 123. From the specific APIs 332, 334 and 336 shown in FIG.3, we assume that applications 320 would include COBOL applications,Java applications, and RPG applications. In the preferred embodiment,each type of application will have its own set of APIs that is uses toaccess data in the database 123. Note, however, that an application of aparticular type may access data in the database 123 using an APIintended for a different application, so long as the application itselfincludes the logic to perform any needed conversion between the datareturned by the API.

Referring now to FIG. 4, a method 400 in accordance with the preferredembodiments shows how the database optimizer 127 may dynamically makechanges to a database to tune the performance of the database accordingto the type of applications that are accessing its data. First, therun-time statistics for the database are read (step 410). Note that therun-time statistics are preferably collected by the database optimizer127, but may also be collected by a separate software tool orapplication as well. Method 400 determines from the statistics if thepercentage of accesses by a particular type of application that wouldbenefit a change of data type exceeds a first threshold (step 420). Ifnot (step 420=NO), no change is made to the database (step 422). If so(step 420=YES), method 400 checks to see if the percentage of accessesby a particular type of application that would benefit from a change ofdata type exceeds a second threshold (step 430). If not (step 430=NO),one or more new reflective columns are created in the database (step450). If so (step 430=YES), the data type of one or more columns in thedatabase is changed (step 440). In this manner, method 400 makesappropriate changes to the database in steps 440 and 450 that will makeaccesses to the data stored in the database perform better according tothe percentage of accesses by different types of applications. Note thatthe changes are determined by the first and second thresholds levels,which may be fixed, but are preferably variable and can be set by thedatabase administrator.

Note that the changes to the database performed by the databaseoptimizer 127 in method 400 of FIG. 4 are changes to data types forcolumns in the database (in step 440), and the addition of columns inthe database (step 450). Note also that the database optimizer 127 mayalso delete reflective columns once they are no longer needed. Thesetypes of changes to the database are describe herein in two differentways. First, these are describe as changes to the database (as in FIG.4). Alternatively, these changes are described as changes to thedatabase schema (as in the claims). Note that these are different waysof saying the same thing, since the database schema dictates thephysical structure and organization of the database. Columns can only beadded to a table by changing the schema for the table. The data type ofa column can only be changed by changing the schema that specifies thedata type for the column. For this reason, the terms “changing thedatabase” and “changing the database schema” are considered equivalent,and no difference between these terms exists for the purposes ofdescribing the preferred embodiments or claims herein.

In the preferred embodiments, the database administrator has the abilityto dictate how the database optimizer 127 functions. Referring to FIG.5, a menu display window 500 allows the database administrator to selectwhether the database optimizer is turned off, is put in an “advise only”mode, or is enabled to make changes to the database automatically. Ifturned off, the database optimizer 127 does not perform any of theoptimization functions described herein as part of the preferredembodiments. If in “advise only” mode, the database optimizer 127 showsto the database administrator what changes the database optimizer 127would have made to optimize performance of the database if automaticchanges were enabled. If automatic changes are enabled (as shown by thecheck in the box in FIG. 5), the database optimizer 127 automaticallymakes the changes to the database to optimize its performance accordingto the applications accessing its data, as shown by method 400 in FIG.4. In addition to setting the database optimizer 127 to “off”, “adviseonly”, or “automatic changes” as shown in FIG. 5, the databaseadministrator may also select the lower and upper threshold levels thatcontrol how the database optimizer 127 performs its functions, as shownin FIG. 6. The lower threshold determines when the database optimizer127 can first take action to optimize the database for accesses by aparticular type of application. The upper threshold determines when theaccesses by a particular type of application become so dominant that itjustifies changing the data type of one or more columns in the databaseto accommodate the dominant type of application. For the specificexample of FIG. 6, the lower threshold is set to 20%, while the upperthreshold is set to 80%. The fact that these two threshold levels inFIG. 6 sum to 100% is coincidental; any suitable values may beindependently selected for the first and second threshold values so longas the upper threshold value is equal to or greater than the lowerthreshold value.

Once a type of application that would benefit from a change of data typeexceeds 20% of the accesses to data within a particular column, thedatabase optimizer may take action to optimize the database. If the typeof application has a number of accesses between the lower and upperthresholds (between 20% and 80% for the example in FIG. 6), reflectivecolumns will be added to the database to provide data in multiple datatypes for multiple application types. Once the number of accesses for aparticular application type exceeds the upper threshold, the data typeof the columns in the database are changed to be optimized for theparticular application. Note that the data access rules 128 in FIG. 1correlate a type of application to its preferred data types that willoptimize performance for that particular type of application.

It is important to note what happens when data is stored in aless-than-optimal format in the database, and no reflective columns arepresent to provide an alternative data type for the data. This happenswhen the number of accesses by an application type is lower than thelower threshold, or is greater than the upper threshold. When anapplication type has a percentage of accesses lower than the lowerthreshold, the database optimizer 127 will not perform optimizations tothe database 123. Note, however, that the database manager 125 willstill return data in a format expected by a requesting application bydetecting which database API 126 was invoked, and by either returningthe data (if already of the desired data type), or automaticallyconverting the data to the desired data type before returning the datato the requesting application. Note that the correlation between aparticular type of application and its preferred data types is stored inthe data access rules 128. When an application type has a percentage ofaccesses greater than the upper threshold, the data type of one or morecolumns in the database will be changed to the preferred data types forthat type of application. Other types of applications that now requestdata from these columns expect a different data type. Again, thedatabase manager 125 accounts for this mismatch between data types, andperforms a conversion between data types before returning the data tothe requesting application. In this manner, the data type preferred bythe type of requesting application is always returned, and when noconversion is necessary, the performance of the API returning the datais significantly increased. Note that the data conversion may beperformed by the database optimizer 127, or may be performed by adifferent portion of code within the database manager 125.

Note that the lower and upper thresholds discussed herein can be eitherinclusive or exclusive of their boundary limits within the scope of thepreferred embodiments. Thus, we discuss the function of the databaseoptimizer 127 in method 400 of FIG. 4 as taking certain actions based onwhether the percentage of accesses exceeds a first threshold (step 420)or exceeds a second threshold (step 430). Note that these steps couldhave alternatively been specified to evaluate whether the percentage ofaccesses is greater than or equal to the first and second thresholds.There is no specific importance regarding where the lines are drawn andwhether the conditions are true when the percentage of accesses is equalto the set threshold levels. The preferred embodiments expressly extendsto any manner of defining a lower threshold and an upper threshold, andfor taking appropriate steps according to those defined thresholdlevels, regardless of whether the boundary limits defined by thethreshold levels are included or excluded in the ranges.

A very simple example is now presented to illustrate the function of thedatabase optimizer 127. Referring to FIG. 7, a table 700 is a verysimple database table referred to as an Employee table that stores thename of a company's employees and their corresponding employeeidentification numbers. The employee table 700 includes a first column710 that has a data type of char(20) and a label of “name”, and a secondcolumn 720 that has a data type of packed decimal (6,0) and a label of“id”. These data types are the preferred data types for a COBOLapplication, and we assume that table 700 was originally created using aCOBOL application. Now, let's assume that the company wants to make theemployee and identification numbers available to Java applications aswell as COBOL applications. In the prior art, when a Java applicationaccesses the data in table 700, the database manager 125 would convertthe data from its stored data type to data types that Java expects. Inthe preferred embodiments, the database manager 125 also performs theconversion between data types by knowing what kind of application isrequesting the data by determining which database API 126 was invoked,and by looking at the data access rules 128 that correlate a type ofapplication to its preferred data types. With this information, thedatabase manager 125 returns the preferred data type to the requestingapplication. It has been shown by extensive experience that Javaprocesses unicode much more efficiently than character text, andprocesses integers much more efficiently than packed decimals. Wetherefore assume that the data access rules 128 list char and packeddecimal as preferred data types for COBOL applications, while unicodeand int are preferred data types for Java applications. We assume forthis example that the lower and upper threshold levels are set at 20%and 80%, respectively, as shown in FIG. 6. For these threshold levels,for any percentage of accesses to a column by Java applications that areless than 20%, the database manager simply retrieves the data as storedin the table of FIG. 7, and performs the conversion between theretrieved data types and the preferred data types for Java (namely,unicode and int). For the table 700 of FIG. 7, a COBOL applicationrequests access to the “name” column in table 700 by invoking a COBOLAPI (e.g., COBOL API 332 of FIG. 3). The database manager 125 detectsthat a COBOL API was invoked, and looks at the data access rules 128 todetermine that the preferred data type for text is char. The databasemanager 125 then looks at the database schema 124 to determine if acolumn in table 700 has the “name” information in “char” format. Column710 has the “name” information in “char” format, so data from column 710is returned. In similar fashion, a Java application requests access tothe “name” column in table 700 by invoking a Java API (e.g., Java API334 of FIG. 3). The database manager 125 detects that a Java API wasinvoked, and looks at the data access rules 128 to determine that thepreferred data type for text is unicode. The database manager 125 thenlooks at the database schema 124 to determine if a column in table 700has the “name” information in “unicode” format. There is no column intable 700 that has the “name” information in “unicode” format, so thedatabase manager 125 retrieves the char(20) data from column 710,converts the char(20) data to unicode(20) data, and returns theunicode(20) data to the requesting Java application. In this manner, thedatabase manager 125 automatically converts from a stored data type todata type desired by the requesting application, when needed. However,the most significant advantage of the preferred embodiments is thereduction in the percentage of times a conversion is needed by changingthe database schema to store data in one or more formats compatible withthe type of applications that access the data most frequently.

Referring back to FIG. 4, once the percentage of accesses for aparticular type of application that would benefit from a change in datatype exceeds a first threshold (e.g., the lower threshold of 20%), thedatabase optimizer 127 may take action to optimize the database bychanging the database. If the percentage of accesses for the type ofapplication lies between the first and second threshold levels (e.g.,between the lower threshold of 20% and the upper threshold of 80%),reflective columns are created in the database (step 450 of FIG. 4).Once the percentage of accesses for the type of application exceeds thesecond threshold (e.g., is greater than 80%), some of the reflectivecolumns are deleted, leaving only the columns that contain the data typefor the dominant type of application accessing the data.

Let's say that the percentage of accesses for Java applications for boththe name and id columns in table 700 is 30%. This lies between the lowerthreshold of 20% and the upper threshold of 80%, so the databaseoptimizer 127 creates reflective columns (step 450 in FIG. 4), as shownin table 700A of FIG. 8. Note that the first two columns are the samecolumns 710 and 720 in FIG. 7. However, two new columns 810 and 820 areadded that have different names and data types. Column 810 of table 700Ahas a data type of unicode(20), and has a label of “name_ref”. The datain column 810 is represented by asterisks “*” to indicate that the datain this column is identical to the data in the “name” column 710, onlyit is stored in unicode(20) format rather than char(20) format.Similarly, the fourth column 820 of table 700A has a data type of int,and has a label of “id_ref”. The data in column 820 is represented byplus signs “+” to indicate that the data in this column is identical tothe data in the “id” column, only it is stored in int format rather thanpacked decimal (6,0) format. Note that columns 710 and 810 are said tobe “reflective” columns because they reflect the same data in differentdata types (or formats). Similarly, columns 720 and 820 are reflectivecolumns.

A COBOL application requests access to the “name” column in table 700Aby invoking a COBOL API (e.g., COBOL API 332 of FIG. 3). The databasemanager 125 detects that a COBOL API was invoked, and looks at the dataaccess rules 128 to determine that the preferred data type for text ischar. The database manager 125 then looks at the database schema 124 todetermine which column in table 700A has the “name” information in“char” format. Column 710 is the appropriate column, so data from column710 is returned. In similar fashion, a Java application requests accessto the “name” column in table 700A by invoking a Java API (e.g., JavaAPI 334 of FIG. 3). The database manager 125 detects that a Java API wasinvoked, and looks at the data access rules 128 to determine that thepreferred data type for text is unicode. The database manager 125 thenlooks at the database schema 124 to determine which column in table 700Ahas the “name” information in “unicode” format. Column 810 is theappropriate column, so data from column 810 is returned. In this manner,data may be stored in multiple data types (or formats) in the preferredembodiments to allow efficiently returning data in a format that therequesting application expects.

We now assume for our example that the percentage of accesses by Javaapplications to one or more columns in table 700A of FIG. 8 rises abovethe upper threshold level of 80%. In this case, the database managerdeletes the original columns 710 and 720 from table 700A, and renamesthe columns 810 and 820 to the names of the original columns (i.e.,“name” and “id”), as shown in table 700B in FIG. 9. Now, when a COBOLapplication requests access to the “name” column in table 700 byinvoking a COBOL API (e.g., COBOL API 332 of FIG. 3), the databasemanager 125 detects that a COBOL API was invoked, and looks at the dataaccess rules 128 to determine that the preferred data type for text ischar. The database manager 125 then looks at the database schema 124 todetermine if a column in table 700B has the “name” information in “char”format. There is no column in table 700 that has the “name” informationin “char” format, so the database manager 125 retrieves the unicode(20)data from column 810, converts the unicode(20) data to char(20) data,and returns the char(20) data to the requesting COBOL application. Insimilar fashion, a Java application requests access to the “name” columnin table 700B by invoking a Java API (e.g., Java API 334 of FIG. 3). Thedatabase manager 125 detects that a Java API was invoked, and looks atthe data access rules 128 to determine that the preferred data type fortext for a Java request is unicode. The database manager 125 then looksat the database schema 124 to determine if a column in table 700B hasthe “name” information in “unicode” format. Column 810 has the “name”information in “unicode” format, so data from column 810 is returned. Inthis manner, the database optimizer 127 causes the database 123 toautomatically evolve according to the types of applications requestingdata from the database. Note that the data optimizer 127 may operate onthe column level, which means that accesses may be tracked to particularcolumns, and appropriate changes as described in method 400 of FIG. 4may be performed on individual columns in a table without affectingother columns in the table. This allows the optimizer 127 to only builda reflective column if the run-time statistics 129 indicate a percentageof accesses to that particular column exceeds the first threshold level.

One significant advantage of the preferred embodiments is the definitionof different levels of metadata. In the prior art, the databaseadministrator may review metadata that shows how data is stored in thedatabase. The metadata is representative of the database schema. In theprior art, metadata for a database is typically stored in two tables,one to track the tables in the database, and another to track columns inthe tables. Referring to FIG. 10, table 1000 is table that shows a verysimplified representation of metadata for tables in a simple database.Each entry in table 1000 represents a different table in the database.The SchemaName column references the schema for that particular table.The TableName column contains the name of the table. The TableOwnercolumn identifies who the owner is for the table. The ColumnCount columnindicates how many columns are in the table. For the simple table 1000in FIG. 10, two tables are shown, the first being the Employee table 700of FIG. 7, and the second being a table called Table3. Of course, othertables can also exist within the database represented by the metadata intable 1000. The Employee table is assumed to have a schema labeled“EmpSchema”, and its owner is specified by the user profile of theowner. As shown in table 700 of FIG. 7, the Employee table has twocolumns.

Referring now to FIG. 11, a table 1100 represents a table of columnsthat shows metadata for the columns in the Employee table, but couldalso include metadata for columns in other tables as well. As shown inFIG. 11, the “name” column in the Employee table 700 has a data type ofchar and a size of 20. The “id” column in the Employee table 700 has adata type of packed decimal and a size of (6,0). The table of tables1000 in FIG. 10 and the table of columns 1100 in FIG. 11 together makeup metadata as known in the prior art that describes the employee tableof FIG. 7.

In the preferred embodiments, the metadata is changed to accommodate thepossibility of adding reflective columns. For table 700 of FIG. 7, thepreferred embodiments have a table of tables 1200 shown in FIG. 12 and atable of columns 1300 shown in FIG. 13. The table of tables 1200includes an additional column 1210 when compared to the prior art tableof tables 1000 in FIG. 10 that contains a flag to indicate whether ornot the table contains reflective columns. Table 700 of FIG. 7 does notcontain reflective columns, so this flag is set to FALSE for theEmployee table. The table of columns 1300 in FIG. 13 contains three newcolumns 1310, 1320 and 1330. Column 1310 contains a flag that indicateswhether the column is a reflective column. Column 1320 contains the nameof the column for which this column is reflective, or contains “null” ifthe column is not a reflective column. Column 1330 contains the name ofa paradigm for which this column is optimized. Because table 700 in FIG.7 was created by a COBOL application that prefers char(20) and packeddecimal(6,0) format, the column was optimized for the COBOL programmingparadigm. Note that for both the name and id columns in table 700 ofFIG. 7, the metadata in the table of columns 1300 of FIG. 13 specifiesthat neither of these columns is reflective, and that they are optimizedfor COBOL.

Now we examine how the metadata changes when reflective columns areadded. Table 700A in FIG. 8 contains reflective columns. The table oftables 1400 in FIG. 14 and the table of columns 1500 in FIG. 15represent metadata that describes the table 700A in FIG. 8. Thus, column1210 in the table of tables 1400 of FIG. 14 is set to TRUE for theEmployee table to indicate that the Employee table includes reflectivecolumns. Now we analyze the differences in the metadata in the table ofcolumns 1500 shown in FIG. 15, which represents table 700A of FIG. 8.Because the “name” and “id” columns both have reflective columns“name_ref” and “id_ref”, respectively, column 1310 in table 1500 of FIG.15 are true for all columns in the Employee table. Column 1320 specifieswhich column in the table is the reflective column, and column 1330specifies which paradigm the column is optimized for. Note that thenames “name_ref” and “id_ref” are shown herein as examples of suitablenames that would easily correlate reflective columns by the addition ofa “ref” suffix. However, the actual names of reflective columns wouldpreferably be assigned by the database optimizer 127 in a way that wouldminimize the likelihood that such a name would be explicitly created bya user. For example, a suffix of “##$##” could be assigned by thedatabase optimizer for reflective columns, which would minimize thelikelihood of a user creating a column of this name. By providingmetadata as shown in the table of tables 1400 and table of columns 1500,the database optimizer 127 may easily keep track of when reflectivecolumns are present in a table, and which columns contain data in whichdata types. This allows the database optimizer to return the data typethat matches the requesting application, when possible, when reflectivecolumns exist.

The table of tables 1600 in FIG. 16 and the table of columns 1700 inFIG. 17 represent metadata that describes the table 700B in FIG. 9. Notethat in table 700B the original columns 710 and 720 have been deleted,and the newer, Java-optimized columns 810 and 820 have been renamed tothe names of the original columns. These changes are reflected in themetadata for this table. First, column 1210 in table 1600 thatcorresponds to the Employee table is set to False to indicate that noreflective columns are present in the Employee table. Next, the twoentries in the table of columns 1700 relating to the COBOL data typeshave been deleted, and the column name of the remaining Java data typeshave been renamed to the original column names. Column 1310 for each ofthe remaining columns is False because there are no longer reflectivecolumns in the table. Column 1320 for each of the remaining columns isNull, and column 1330 specifies that these columns are optimized for theJava programming paradigm.

The simple example presented herein that shows the evolution of table700 in FIG. 7 to table 700A of FIG. 8 to table 700B of FIG. 9, alongwith the associated metadata and its changes in FIGS. 12-17, shows howthe database optimizer 127 may cause a database to dynamically evolveaccording to which applications are accessing columns in the databaseand with what frequency. A database table optimized for COBOL (e.g.,table 700 in FIG. 7) may be transformed to include reflective columnsthat provide increased performance for Java applications that need toaccess data in the table. Note that this increase in performance comesat the expense of additional storage space in the database. This isanother feature that is subject to potential customization by a databaseadministrator. For example, the database administrator could specify amaximum database size, or percent of database growth, that could be usedfor reflective columns. The database optimizer 127 could then createreflective columns so long as the maximum is not exceeded. In the casethat creating additional reflective columns would exceed the specifiedmaximum, the database optimizer 127 could also include heuristics todetermine whether the current optimization of potentially addingreflective columns outweighs the benefits of reflective columns thatcurrently exist, and can therefore delete some current reflectivecolumns to make room for new reflective columns.

Once the percentage of accesses by Java applications exceeds a secondthreshold level, it is deemed that the benefit of the reflective columnsis outweighed by the cost of the additional required storage, so theoriginal columns are deleted, and the remaining columns are renamed tothe names of the original columns. Now Java applications can directlyaccess the data in the format they prefer, while COBOL applications willsuffer the performance penalty of having the database manager performthe conversion between the Java-optimized data types and theCOBOL-optimized data types. Note also that this evolution can work inboth directions. If the database table evolves from table 700 in FIG. 7to table 700A to FIG. 8 to table 700B in FIG. 9, this is not necessarilythe end of the evolution. Let's assume that there are more COBOLapplications added that access the data. In the alternative, let'sassume that fewer Java applications access the data, perhaps because thedata is available in a different table. Whatever the reason, if thepercentage of accesses by COBOL applications exceeds the first thresholdlevel (e.g., of 20%), reflective columns can be added to the table, andif the percentage of accesses by COBOL applications exceeds the secondthreshold level, the reflective columns could be deleted and the tablewould devolve back to its original state, as shown in table 700 in FIG.7. The database optimizer of the preferred embodiments thus provides away to dynamically tune a database according to the applicationsaccessing data in the database.

The presence of reflective columns in a database table presents someinteresting issues. First of all, reflective columns give rise to twodifferent levels of metadata not know in the prior art. In the priorart, a database user is able to view metadata for a database that showsthe user the structure of the database. The present invention includesthe concept of multiple layers of metadata. The two levels of metadatathat may be viewed according to the preferred embodiments are referredto herein as “application view” and “system view”. The system viewmetadata shows all of the data stored in the database, similar to thedisplay of metadata known in the prior art. However, a new level ofmetadata referred to herein as “application view metadata” is apresentation of metadata from the point of view of a particular type ofapplication. As discussed in detail above, the database optimizer 127may add reflective columns to a database to improve the performance ofapplications that access data in the database. The presence of thereflective columns may not be terribly important to an applicationdeveloper, because it represents changes made to the database by thedatabase manager 125 to enhance the performance of accessing data in thedatabase, but does not affect the logic within the application. When theapplication developer requests to see metadata for the database, themetadata he or she may really be interested in may be the metadata thatapplies to the application view. In other words, the metadata that isspecific to the type of application may be displayed, while the metadatathat is specific to other types of applications may be hidden from view.In the preferred embodiments, a database administrator may determinewhether the application view metadata or the system view metadata isdisplayed when the user requests the display of metadata for thedatabase by specifying a customization setting. One example of such acustomization setting is shown in the menu display window 1800 of FIG.18, which gives the database administrator the ability to check eitherthe “Application View” box or the “System View” box. With the “SystemView” box checked as shown in FIG. 18, the system view metadata isdisplayed to the database administrator when the function to display themetadata is invoked. If the “Application View” box is checked, only themetadata relating to a specific type of application is displayed, andany reflective columns for other types of applications will remainhidden.

Another interesting issue that arises due to the presence of reflectivecolumns is the issue of data coherency between reflective columns. Howcan we assure that the data in two reflective columns are in sync? Thereare many ways to assure the data coherency between reflective columns.The first is to simply perform a write to both reflective columns whendata in either is changed. Because read operations in a databasetypically far outnumber write operations, the performance penalty forhaving to write to two columns instead of one will be small. Another wayto assure data coherency is to allow for one column to be marginally outof date while a background process copies the data from one reflectivecolumn to the other. Another way to assure data coherency is to define a“master column”, which would ensure that one column would always be upto date, while the other could be marginally out of date. This is reallya combination of the first two methods discussed above. If a write tothe database changes a column that is not the master column, animmediate write to the master column will be performed to keep themaster column up to date. If a write to the database changes the mastercolumn, other reflective column(s) may be marginally out of date andupdated by a background process. Finally, another way to assure datacoherency between reflective columns is to flag a column as dirty if itsdata is out of date. This allows for only a single update to happenimmediately to one column, and the update to the reflective column maybe done separately. However, if an application reads a column that hasits dirty flag set (indicating it is out of date with the other column),the column will be immediately updated and the dirty bit cleared beforedoing the read. Of course, there are other methods that could also beused to assure data coherency between reflective columns in a database.The preferred embodiments expressly extend to any and all methods formaintaining data coherency between reflective columns in a database.

One significant advantage of the preferred embodiments is that thedatabase manager 125 returns data from the database in a format (i.e.,of a data type) that matches the type of the requesting application, asdetermined by the API invoked to access the data, without performing asmany conversions between data types as is required by the prior art. Inthe prior art, conversion between different data types is requiredwhenever the data is stored in a format that is different than theformat preferred by the requesting application. Referring to FIG. 19, amethod 1900 in accordance with the preferred embodiments begins whenaccess to data is requested by an application (step 1910). The databasemanager then determines the type of data preferred by the requestingapplication (step 1920). In one suitable implementation, this step isbroken down into the steps of determining which API was invoked, andreferring to the data access rules 128 to determine which data type ispreferred for the type of API that was invoked. Once the preferred datatype is determined in step 1920, the requested data is retrieved fromthe database (step 1930). If the retrieved data is in the preferredformat (step 1940=YES), the data is returned to the requestingapplication without modification (step 1960). If the retrieved data isnot in the preferred format (step 1940=NO), the data is converted to thepreferred format (step 1950), and is returned to the requestingapplication (step 1960). A significant advantage of the preferredembodiments is to decrease the number of times conversion between datatypes is needed by changing the database to store data in a formatoptimized for the type of application that most frequently accesses thedata. Thus, using the apparatus and method of the preferred embodiments,the performance of a database is increased by changing the database toreduce the number of times conversion is required between data types(e.g., in step 1950 of FIG. 19).

The database optimizer 127 has been described extensively above. Many ofits features may be summarized by the block diagram shown in FIG. 20.Database optimizer 127 includes a data access mechanism 2010 thatperforms the function of method 400 of FIG. 4, which allows a databaseto evolve according to the type of applications that access it. Dataaccess mechanism 2010 includes a database modification mechanism 2012that modifies the database schema to provide better performance.Database optimizer 127 also writes metadata 2020 (as shown in FIGS.12-17) that includes reflective column fields 2022 that account forreflective columns, when present. Database optimizer 127 also includescustomization settings 2030 that allow a system administrator tocustomize the function of the database optimizer 127. Several examplesof suitable customization are described above, including thoseillustrated in FIGS. 5, 6, and 18. Data coherency mechanism 2040 is amechanism that maintains coherency of data between reflective columns,as explained in detail above. Data type conversion mechanism 2050 is amechanism that performs required conversions between data types beforedelivering data to the requesting applications, as described in method1900 of FIG. 19. Run-time statistics gathering mechanism 2060 is amechanism that tracks the frequency of accesses to columns in a databaseby different types of applications, and stores this information inrun-time statistics 129 in FIG. 1. The block diagram of FIG. 20 showsthat the database optimizer of the preferred embodiments includes manyfeatures not known in the prior art that provide significant performanceadvantages when compared to prior art techniques for accessing data in adatabase.

The preferred embodiments described herein allow a database todynamically change over time to accommodate the applications accessingit. As shown by the simple example presented in FIGS. 7-9, a databasetable may be created with columns that are of data types that areoptimized for a COBOL application. When other applications, such as Javaapplications, access a column, the data in the COBOL-optimized data typeis converted to data in the Java-optimized data type before returningthe data to the requesting Java application. This allows the Javaapplications to know they will receive data of the proper type,relieving the application of the chore of performing conversion betweendata types. As the number of accesses by Java applications passes afirst threshold level, the database schema may be changed to addreflective columns that provide data in multiple data types. At thepoint in time when the number of accesses by Java applications exceeds asecond threshold level, the original columns that contain theCOBOL-optimized data types may be deleted from the database, and onlythe columns with the Java-optimized data type. When COBOL applicationsaccess a column in this evolved database, the data in the Java-optimizeddata type is converted to data in the COBOL-optimized data type beforereturning the data to the requesting COBOL application. The first andsecond threshold may be set at variable levels according to the needs ofthe database customer. In this manner, the database may be dynamicallytuned to optimize the performance of applications that access thedatabase most frequently. By changing the first and second thresholdlevels, the database designer may trade off the performance penalty ofconverting data between data types with the space required to storereflective columns in the database.

One skilled in the art will appreciate that many variations are possiblewithin the scope of the present invention. Thus, while the invention hasbeen particularly shown and described with reference to preferredembodiments thereof, it will be understood by those skilled in the artthat these and other changes in form and details may be made thereinwithout departing from the spirit and scope of the invention.

1. An apparatus comprising: at least one processor; a memory coupled tothe at least one processor; and a database optimizer residing in thememory and executed by the at least one processor, the databaseoptimizer using statistics regarding the type of applications accessingdata in a database, the frequency with which the applications access thedata, and the location of the data being accessed by the applications tomake at least one change to the database schema to optimize theperformance of accessing data in the database.
 2. The apparatus of claim1 wherein the database optimizer makes the change to the database schemaaccording to a set of rules that specify a preferred data type for eachtype of application accessing data in the database.
 3. The apparatus ofclaim 1 wherein the change to the database schema comprises changing thedata type of at least one column in the database.
 4. The apparatus ofclaim 1 wherein the change to the database schema comprises adding a newcolumn of a second data type to the database that contains the same datain an existing column of a first data type in the database.
 5. Theapparatus of claim 1 wherein the database optimizer receives requestsfrom at least one application to access data in the database, andreturns data from the database of a data type that is expected by therequesting application.
 6. The apparatus of claim 1 wherein the databaseoptimizer further comprises a run-time statistics gathering mechanism togather the statistics.
 7. The apparatus of claim 1 wherein the databaseoptimizer operates according to customization settings set by a humanuser.
 8. The apparatus of claim 1 wherein the database optimizer furthercomprises a data type conversion mechanism that converts data in a firstdata type retrieved from the database to a second data type that ispreferred by an application requesting the data.
 9. A computer-readableprogram product comprising: (A) a database optimizer that usesstatistics regarding the type of applications accessing data in adatabase, the frequency with which the applications access the data, andthe location of the data being accessed by the applications to make atleast one change to the database schema to optimize the performance ofaccessing data in the database; and (B) recordable media bearing thedatabase optimizer.
 10. The program product of claim 9 wherein thedatabase optimizer makes the change to the database schema according toa set of rules that specify a preferred data type for each type ofapplication accessing data in the database.
 11. The program product ofclaim 9 wherein the change to the database schema comprises changing thedata type of at least one column in the database.
 12. The programproduct of claim 9 wherein the change to the database schema comprisesadding a new column of a second data type to the database that containsthe same data in an existing column of a first data type in thedatabase.
 13. The program product of claim 12 wherein the databaseoptimizer further comprises a data coherency mechanism for maintainingcoherency between the existing column and the new column.
 14. Theprogram product of claim 9 wherein the database optimizer receivesrequests from at least one application to access data in the database,and returns data from the database of a data type that is expected bythe requesting application.
 15. The program product of claim 9 whereinthe database optimizer further comprises a run-time statistics gatheringmechanism to gather the statistics.
 16. The program product of claim 9wherein the database optimizer operates according to customizationsettings set by a human user.
 17. The program product of claim 9 whereinthe database optimizer further comprises a data type conversionmechanism that converts data in a first data type retrieved from thedatabase to a second data type that is preferred by an applicationrequesting the data.
 18. An apparatus comprising: at least oneprocessor; a memory coupled to the at least one processor; and adatabase optimizer residing in the memory and executed by the at leastone processor, the database optimizer using statistics regarding thetype of applications accessing data in a database, the frequency withwhich the applications access the data, and the location of the databeing accessed by the applications to make at least one change to thedatabase schema to optimize the performance of accessing data in thedatabase, wherein the database optimizer makes the change to thedatabase schema according to a set of rules that specify a preferreddata type for each type of application accessing data in the database,wherein the change to the database schema comprises adding a new columnof a second data type to the database that contains the same data in anexisting column of a first data type in the database, wherein thedatabase optimizer further comprises a data coherency mechanism formaintaining coherency between the existing column and the new column,wherein the database optimizer receives requests from at least oneapplication to access data in the database, and returns data from thedatabase of a data type that is expected by the requesting application,wherein the database optimizer further comprises a run-time statisticsgathering mechanism to gather the statistics, wherein the databaseoptimizer operates according to customization settings set by a humanuser, wherein the database optimizer further comprises a data typeconversion mechanism that converts data in a first data type retrievedfrom the database to a second data type that is preferred by anapplication requesting the data.